I � Unicode

Using UTF-8 in Debian

You probably came to this page because you have a character encoding problem of some kind. With the current proliferation of character encodings this is unfortunately way too common. The most effective method to avoid these problems is to pick one character encoding and stick to it. However, if you stick to one, you have to be sure that it is functional and compatible enough for both your future and current needs. Pretty much the only character set that supports all languages is Unicode. One of its encodings, UTF-8, is similar enough to ASCII to be usable right now.

This short guide assumes Debian sid (unstable/experimental) but many parts will be useful in other systems that are GNU libc based.

About Unicode and UTF-8

To preempt confusion, a short explanation of what Unicode and UTF-8 are. In truth they are very different things: Unicode is a mapping between numbers and real glyphs, while UTF-8 is a way to encode such numbers into bytes. For example, Unicode specifies that the symbol for ‘infinity’ (∞) shall have number 8734, usually denoted in hexadecimal as U+221E. If we encode this in UTF-8 we get the byte sequence E2 88 9E. UTF-8 sequences can be any length from one to four bytes (six in theory), although sequences longer than three bytes are rare.

Here is a small example of what UTF-8 and Unicode are useful for. No other character set can represent both the artist's name and the lyrics at the same time.

On to the more practical stuff.

Your locales

The first step in getting UTF-8 to work is configuring your locales. If you installed a recent version of Debian (such as etch) or Ubuntu, locales should be set up to use UTF-8 by default. However, if you did a custom install without locales, read on.

In Debian locales have to be generated if you want to use them. Append this line to your /etc/locale.gen:

en_US UTF-8

Remember to remove the line that indicates that this file is autogenerated. After this, run the locale-gen script. This should cover everything that has to be done as the root user.

Once this is done, to make sure all applications are started with the correct locale you should create a small file (~/.utf8) that sets the required environment variable:

export LANG=en_US.UTF-8

Optionally you could set the following additional variable to make sure the ‘less’ program doesn't get confused:

export LESSCHARSET=utf-8

However, be aware that less does not handle doublewidth characters properly.

To get perl to deploy its considerable array of Unicode features:

export PERL_UTF8_LOCALE=1 PERL_UNICODE=AS

The next step is to ensure that these environment variables are always set. First, source this file in your shell startup scripts. For bash, add this line to your ~/.bashrc:

. ~/.utf8

In some versions of bash this file is not read for all interactive sessions but only for those that are not login shells. This contradicts the manpage (i.e., this is a bug in bash). To fix it, append the line to your ~/.bash_profile too as a workaround.

In zsh you can add this line to your ~/.zshenv instead.

Additionally, if you use an X login manager (xdm, gdm, kdm, wdm, &c) you sometimes have an option to set the locale properly when you log in. Make sure you do, because programs started from your window manager, toolbar or other desktop environment component inherit their environment from this.

Configuring screen

The screen program will manage a number of permanent terminals for you which you can detach and reattach at your convenience. It supports UTF-8 modes quite nicely if you tell it to. You should add the following options to your ~/.screenrc:

defc1 off
defutf8 on

After this, restart screen but make sure that the LANG variable is set when you start it.

Configuring xterm

In the standard xterm package there's a script called uxterm that will start an X terminal just right for UTF-8 support.

Configuring mutt

mutt requires that the LANG variable is set properly.

To make sure that outgoing messages get their charset set correctly, append this to your .muttrc:

set charset=utf-8

Configuring irssi

The irssi IRC client can handle UTF-8 quite nicely (v0.8.10). The only setting it needs is:

/set term_charset utf-8

Don't forget to /save after setting it.

Like mutt, irssi needs the LANG environment variable.

Configuring vi

Unfortunately nvi still assumes that a character is always one byte. To edit UTF-8 encoded documents you can use vim instead.

Looking up characters

In order to use all this Unicode goodness, install the gucharmap program to look up glyphs. If you don't feel like installing hundreds of megabytes of GNOME2 libraries, try the following simple perl command line:

perl -CS -e 'for($i=160;$i<10000;$i++){print chr $i, $i%30?" ":"\n"}'

Converting stuff

When it comes to converting, the recode package can't be beat. Examples:

recode l1..u8 somefile.txt
recode eucjp..u8 somefile.txt
recode l1..u8 <somefile.txt | less

Fonts

For some nice unicode X fonts (great for uxterm) try apt-get installing xfonts-efont-unicode and xfonts-efont-unicode-ib. To use them you can cut and paste the following into your ~/.Xresources:

UXTerm.fontMenu.fontdefault.Label: Normal
UXTerm.fontMenu.font1.Label: Illegible
UXTerm.fontMenu.font2.Label: Smallest
UXTerm.fontMenu.font3.Label: Small
UXTerm.fontMenu.font4.Label: Large
UXTerm.fontMenu.font5.Label: Huge
UXTerm.fontMenu.font6.Label: Bold
UXTerm.vt100.utf8Fonts.font: -efont-fixed-medium-r-normal--14-*-*-*-c-*-iso10646-1
UXTerm.vt100.utf8Fonts.font1: nil2
UXTerm.vt100.utf8Fonts.font2: -efont-fixed-medium-r-normal--10-*-*-*-c-*-iso10646-1
UXTerm.vt100.utf8Fonts.font3: -efont-fixed-medium-r-normal--12-*-*-*-c-*-iso10646-1
UXTerm.vt100.utf8Fonts.font4: -efont-fixed-medium-r-normal--16-*-*-*-c-*-iso10646-1
UXTerm.vt100.utf8Fonts.font5: -efont-fixed-medium-r-normal--24-*-*-*-c-*-iso10646-1
UXTerm.vt100.utf8Fonts.font6: -efont-fixed-bold-r-normal--16-*-*-*-c-*-iso10646-1
UXTerm.vt100.utf8Fonts.boldFont: -efont-fixed-bold-r-normal--14-*-*-*-c-*-iso10646-1

Don't forget to run xrdb -merge ~/.Xresources.

Have fun!

Back to the index page

mail me